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ABSTRACT 

We present techniques for the estimation of stellar atmospheric parameters (r eff , log g, [Fe/H]) for stars from the SDSS/SEGUE 
survey. The atmospheric parameters are derived from the observed medium-resolution (R = 2000) stellar spectra using non-linear 
regression models trained either on (1) pre-classified observed data or (2) synthetic stellar spectra. In the first case we use our models 
to automate and generalize parametrization produced by a preliminary version of the SDSS/SEGUE Spectroscopic Parameter Pipeline 
(SSPP). In the second case we directly model the mapping between synthetic spectra (derived from Kurucz model atmospheres) and 
the atmospheric parameters, independently of any intermediate estimates. After training, we apply our models to various samples of 
SDSS spectra to derive atmospheric parameters, and compare our results with those obtained previously by the SSPP for the same 
samples. We obtain consistency between the two approaches, with RMS deviations on the order of 150 K in r cff , 0.35 dex in log g, 
and 0.22 dex in [Fe/H]. 

The models are applied to pre-processed spectra, either via Principal Components Analysis (PCA) or a Wavelength Range Selection 
(WRS) method, which employs a subset of the full 3850-9000 A spectral range. This is both for computational reasons (robustness and 
speed), and because it delivers higher accuracy (better generalization of what the models have learned). Broadly speaking, the PCA 
is demonstrated to deliver more accurate atmospheric parameters when the training data are the actual SDSS spectra with previously 
estimated parameters, whereas WRS appears superior for the estimation of log g via synthetic templates, especially for lower signal- 
to-noise spectra. From a subsample of some 19 000 stars with previous determinations of the atmospheric parameters accuracies of 
our predictions (mean absolute errors) for each parameter are r cff to 170/170 K, log g to 0.36/0.45 dex, and [Fe/H] to 0.19/0.26 dex, 
for methods (1) and (2), respectively. We measure the intrinsic errors of our models by training on synthetic spectra and evaluating 
their performance on an independent set of synthetic spectra. This yields RMS accuracies of 50 K, 0.02 dex, and 0.03 dex on T cff , 
log g, and [Fe/H], respectively. 

Our approach can be readily deployed in an automated analysis pipeline, and can easily be retrained as improved stellar models and 
synthetic spectra become available. We nonetheless emphasise that this approach relies on an accurate calibration and pre-processing 
of the data (to minimize mismatch between the real and synthetic data), as well as sensible choices concerning feature selection. 
From an analysis of cluster candidates with available SDSS spectroscopy (M 15, M 13, M 2, and NGC 2420), and assuming the age, 
metallicity, and distances given in the literature are correct, we find evidence for small systematic offsets in T s ff and/or log g for the 
parameter estimates from the model trained on real data with the SSPP. Thus, this model turns out to derive more precise, but less 
accurate, atmospheric parameters than the model trained on synthetic data. 

Key words. Astronomical data bases: Surveys - Methods: data analysis - Methods: statistical - Stars: fundamental parameters - 
Galaxy: globular/open clusters individual: M 15, M 13, M 2/NGC 2420 



1. Introduction 

The nature of the stellar populations of the Milky Way galaxy 
remains an important issue for astrophysics, because it addresses 
the question of galaxy formation and evolution and the evolution 
of the chemical elements. To date, however, studies of the stellar 
populations, kinematics, and chemical abundances in the Galaxy 
have mostly been limited by small number statistics. 

Fortunately, this state of affairs is rapidly changing. The 
Sloan Digital Sky Survey (SDSS; 



York et al. 2000 1 has im 



aged over 8000 square degrees of the northern Galactic cap 
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(above \b\ = 40") in the ugriz photometric system for some 
100 million stars. Imaging data are produced simultaneously 
([Fukugita et al.|1996||Gunn et al.| 1998) [20061 |Hogg e * al.|2001[ 
[Abazajian et al.|2005| |Adelman-McCarthy et al.|2Q07| > and pro- 
cessed through pipelines to measure photometric and astrometric 



properties (jLupton et al.|1987{|Stoughton et al. 



2002; Smith etal. 



|20Tj2l [Tucker et al.|2002HPier et al.r2003||Ivezic et al.|2004| > and 
to select targets for spectroscopic follow-up. Of even greater im- 
portance, some 200 000 medium-resolution stellar spectra have 
been obtained during the course of SDSS-I (the original survey). 

The SDSS-II project, which includes SEGUE (Sloan 
Extension for Galactic Understanding and Exploration), is ob- 
taining some 3500 square degrees of additional imaging data 
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at lower Galactic latitudes, in order to better explore the in- 
terface between the thick-disk and halo populations between 
0.5 - 4 kpc from the Galactic plane. SEGUE will obtain some 
250000 medium-resolution spectra of stars in the Galaxy in 
the magnitude range 14.0 < g < 20.5 in 200 fields cover- 
ing the sky visible from the northern hemisphere (Apache Point 
Observatory, New Mexico). The targets are selected based on the 
photometry, and are chosen to provide tracers of the structure, 
chemical evolution, and stellar content of the Milky Way from 
0.5 to 100 kpc from the Sun. Taken together, the stellar database 
from SDSS-I and SEGUE provides an unprecendented opportu- 
nity for developing better understanding of the properties of the 
Milky Way. 

Of special importance to achieve these goals is the deter- 
mination of intrinsic stellar physical properties, such as masses, 
ages, and elemental abundances. The first step toward achiev- 
ing this goal is to estimate the atmospheric parameters for these 



stars. A number of early studies (e.g., Gulati et al. 


1996 


Bailer- 


Jones et al.|1997||Bailer- Jones, Irwin & von Hippel 


1998 


Bailer- 


Jones||2000| |Snider et al.||2001| |Willemsen et al.||2005)l have 



demonstrated that non-linear regression models can be robust 
and precise classifiers of stellar spectra, either when trained on 
pre-classified observed data or on synthetic stellar spectra. In 
this paper we further explore the capability of these techniques 
to estimate r eff , log g, and [Fe/H] specifically for SDSS/SEGUE 
spectroscopy and photometry. Alternative procedures are de- 
scribed by |Allende Prieto et al.| ( |2006| l, |Lee et al.| ( |2006| >, and 
|Lee et al.| ( |2007l i~ 

In this paper we explore three approaches in which either 
synthetic ('S') or real ('R') data are used for training and/or 
testing. With SS (training and testing on synthetic data), es- 
timates of the atmospheric parameters are obtained from the 
model spectra, and the application is merely a test of the limits 
of the pre-processing/regression model. In RR (training and test- 
ing on real data), we use a set of pre-parametrized SEGUE spec- 
tra, in this case from a preliminary version of the SDSS/SEGUE 
Spectroscopic Parameter Pipeline (SSPP). Our model automates 
and, more importantly, generalizes these parametrizations. The 
model performance is evaluated on a separate set of data ob- 
tained from SDSS/SEGUE. SR is a model trained on synthetic 
data and applied to real data, thus allowing us to directly deter- 
mine the atmospheric parameters without using an intermediate 
parametrization model. As we have no definitive "true" values 
against which to compare our parametrizations, we instead com- 
pare the results of the SR and RR models to parameters esti- 
mated by the SSPP (on a set of data not used to train RR). Of 
course, in both the SR and RR cases the derived parameters are 
based on a set of model atmospheres - the difference is how the 
atmospheric parameters are derived from them. 

The layout of this paper is as follows. In Sect. 2 we de- 
scribe the spectroscopic and photometric data from which pre- 
liminary estimates of the atmospheric parameters were obtained. 
Our regression model is described in Sect. 3] In Sect. [4] we dis- 
cuss the advantages of dimensionality reduction via Principal 
Component Analysis, as well as from wavelength ("feature") se- 
lection. The results of the application of our methods using the 
SS, RR, and SR approaches are discussed in Sect. [5] An indepen- 
dent assessment of the accuracy (and calibration) of our models 
is provided in Sect. [6] where we estimate the atmospheric pa- 
rameters of stars in several Galactic globular and open clusters. 
Finally, in Sect. [7] we provide our conclusions. 



2. Data 

In this section we discuss the SDSS/SEGUE spectra and the syn- 
thetic spectra that were constructed in order to build our models. 

2.1. Sample of real spectra 

Stellar spectra from SDSS/SEGUE cover the wavelength range 
3850-9000 A at a resolving power R = A/AA ^ 2000. The 
spectra are wavelength calibrated and approximately flux cor- 
rected using procedures described in Stoughton et al. ( 2002| l. For 
the purpose of our work, we first rebin to a final dispersion of 
1.0 A/pixel in the blue region 3850-6000 A, and 1.5 A/pixel in 
the red region 6000-9000 A. Since the spectrophotometric cor- 
rections applied to these spectra are only approximate, we re- 
move the continuum via an automated, iterative procedure (de- 
scribed in Sect. |2.2[ ). 

We have selected a sample of 38 731 stellar spectra for stars 
in regions of low reddening, and for which atmospheric param- 
eter estimates of effective temperature, gravity, and metallicity 
(r e jf, log g, [Fe/H]) have been obtained previously using the 
combination of procedures described in the SSPP (Lee et al. 



2007), including several that rely on the available ugriz pho- 
tometry. These methods include chi-square minimization with 
respect to synthetic spectral templates, neural networks, autocor- 
relation analysis, and a variety of line index calculations based 
on previous calibrations with respect to known standard stars. 
Estimates of the likely external errors in spectroscopic parame- 
ter determinations are in the process of being obtained by com- 
parison with a number of previously available stellar spectro- 
scopic libraries, as well as with high-resolution spectroscopy of 
over 100 SDSS/SEGUE stars. The use of multiple methods al- 
lows for empirical determinations of the internal errors for each 
parameter. However, we remark that at present the parameters 
from SSPP are inhomogeneously assembled, in the sense that 
we are still in the process of exploring which techniques are op- 
timal over the parameter ranges which we study. This situation 
will change in the near future, when the techniques involved in 
the SSPP can be evaluated more fully, and are used to produce a 
meaningful weighted average. 

Radial velocities estimated by the SSPP are used to reduce 
all spectra to a common radial velocity zero point. 



2.2. Sample of synthetic spectra 

In recent years a number of new atmospheric models cov- 
ering a wide range of atmospheric parameters have become 
available. Here we make use of a set of 1816 synthetic spec- 



tra calculated from Kumcz's NEWOD F models (|Castelli & 
Kurucz 2003) with solar abundances by Asplund, Grevesse & 



Sauval (2005), including H2O opacities, an improved set of 
TiO lines, and no convective overshoot (Castelli, Grat ton &| 
Kuruczp997 1. All pertinent molecular species are included in 
these models, even those whose features have minor strength in 
the wavelength range covered by the SDSS spectra. The syn- 
thetic spectra are generated using the turbospectrura synthe- 
sis code ( Alvarez & Plez|[l998[ ), and employ line broadening 
according to the prescription of |Barklem & O'Mara] ( |1998| l. 
The linelists used come from a variety of sources. Updated 
atomic lines are taken mainly from the VALD database (Kupka 
|et al.| [T999) . The molecula r species CH, CN, a nd OH are 
provided by B. Plez (see Plez & Cohen | 2005 1, while the 
NH, C2 molecules are from the Kurucz linelists (see http: 
//kurucz.harvard.edu/LINELISTS/LINESMOL/). Note that, at 
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Fig. 1. The grid of stellar atmospheric parameters T e ff, log g, and [Fe/H]. The synthetic parameters (plus symbols) are presented in 
comparison with previously estimated atmospheric parameters (black dots) for 38 731 SDSS/SEGUE spectra. 



present, the linelists used to generate the synthetic spectra do not 
include all of the interesting molecular species, in particular, the 
MgH and CaH features. We plan to include these molecules in 
an updated version of our synthetic spectra, which is now under 
construction. 

Our grids span the parameter ranges [3500, 10000] K in 
Jeff (27 values, stepsize of 250 K), [0, 5] in log g (11 values 
in 0.5 dex steps), and [-4.0, 0.0] in [Fe/H] (7 values, stepsize 
between 0.5 dex and 1 .5 dex; there is gap in the grid between 
[Fe/H] = -2.5 and -4.0). The synthetic spectra are similarly 
divided into blue and red regions, and the same dispersion cor- 
rection and flux "calibration" (i.e. instrument modeling) were 
applied to match the real SDSS/SEGUE spectra. Figure[T]shows 
the grid of the available parameters. The data used cover the 
full input range provided, 3850-9000 A, in 4152 individual data 
bins. It should also be noted that we have not implemented any 
procedure to account for the inevitable presence of telluric lines, 
in particular near the location of the calcium triplet. At present, 
new reductions procedures for SDSS spectra are being explored 
to minimize the impact of telluric lines in this region. 

The continuum is removed by dividing the spectrum by an 
iterative fifth-order polynomial fit of the spectrum. This is done 
separately for the blue and red regions. In the following we ex- 
clude the red region 6000-6500 A, because we found that the 
synthetic spectra do not properly model the real ones. This dis- 
crepancy may be due in part to instrumental signatures in this 
spectral region, which corresponds to the wavelengths where the 
dichroic used in the dual-arm SDSS spectrographs split the in- 
coming photons into the blue and red arms. 



3. Non-linear regression model 

We implement a flexible method of regression that provides a 
global non-linear mapping between a set of inputs (the stellar 
spectrum x,) and a set of outputs (the stellar atmospheric param- 
eters, s = {7W,log g, [Fe/H]}) 



s(p) = f 



(1) 



where p denotes the p th flux vector (star) and w,- the set of 
weights that characterise the regression model ( |Bailer-Jones| 
|2000| l. To reduce the dynamical range of T e ff and to better rep- 
resent the uncertainties we use log . Furthermore, in order to 
put all variables (s and x (/ ,) on an equal footing, we set, for each 



variable, the mean to zero and standard deviation to unity (a lin- 
ear conversion). This helps with the internal stability of most 
machine learning algorithms. 

The free parameters, jw), of the model are the learned er- 
ror minimization using sets of data for which inputs and their 
corresponding outputs are known. This is an iterative procedure 
in which patterns are presented to the model, the outputs calcu- 
lated, and the difference between these and the target outputs are 
used to perturb the weights in a direction that reduces the error. 
Learning is stopped once the rate of reduction of the error drops 
below some threshold. Our error function comprises two parts. 
The first term in the equation below is the sum-of-squares error 
in the predictions (the likelihood), the second is a regularization 
term, 



( I \ 1 



(2) 



where, for each pattern p, T(p)i and y{p)i are the target value and 
its estimate from the regression method for the I th atmospheric 
parameter, respectively. The model (hyper)parameters f3i dictate 
the relative importance of each parameter in the total error, and a 
specifies the degree of regularization. In the present work these 
hyperparameters were optimized via a brute force search (con- 
ditioned by experience). We actually use a "committee" of ten 
identical models trained on the same data, but trained from dif- 
ferent initial random weights. Estimates of the atmospheric pa- 
rameters obtained by the application of the model are the aver- 
age of the ten individual estimates. This simple approach helps 
overcome "convergence" noise and, on average, increases the 
accuracy obtained. 

Our estimate of the accuracy of the model in the application 
phase is the mean absolute error 



1 N 
p=i 



(3) 



where C(p) is the committee estimate, and T(p) is an indepen- 
dent estimate for the p' h spectrum. For the SR and RR mod- 
els (see Sect. [5]l T is an estimate based on other methods (e.g., 
SSPP), so E is not a real "error", but rather a discrepancy (as 
there is no definitive "ground truth"). 

4. Dimensionality reduction 

Our initial models based on the full spectrum produced good 
results, but we find that the full spectrum is not necessary (not 
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Fig. 2. Reconstruction of SDSS/SEGUE spectra by projection onto synthetic principal components. In each row, the spectrum on 
the left is the original and the following show the reconstruction using increasing numbers of principal components. The residual 
spectrum (original minus reconstructed) is shown in the bottom of each panel. The quoted atmospheric parameters are taken from a 
preliminary version of the processing pipeline SSPR 
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surprisingly, as it contains a large amount of redundant infor- 
mation). Dimensionality reduction often leads to enhanced reli- 
ability, because of the smaller number of parameters employed, 
and the considerably reduced computing time. We investigated 
various approaches and retained two - Principal Component 
Analysis (e.g.; |Hastie et al.||2001| |Singh et al."1|2001| |Bailer- 
|Jones, Irwin & von Hippel||1998| and references therein) and a 
Wavelength Range Selection (e.g., Beers et al. 1999; Willemsen 



et al.|2005 i - in the present work. 



4.1. Principal Component Analysis (PC A) 

Principal Component Analysis (PCA) linearly transforms a set 
of data via a rotation of the coordinate system, and an offset 
of its origin. The new axes (or principal components, the PCs) 
are chosen such that the projection of the data onto each axis in 
turn maximizes the variance in the data. If we have a set of n 
vectors (spectra), x, of dimension N (the number of flux bins), 
then formally the principal components are the eigenvectors, 
(k — 1 . . . AO, of the covariance matrix of the data. The p th spec- 
trum is reconstructed using the PC basis as 



y P (r) = ^ a kp u k 



where 



O-kp - X p ■ Uj 



(4) 



(5) 



are the so-called "admixture coefficients". These represent the 
spectrum in the new (PC) space in the same way that the original 
spectrum did in the original (flux bin) space (i.e., they can be 
used as inputs in our regression models). If we set r — N then 
we reconstruct the spectra exactly. If r < N we have a reduced 
reconstruction, i.e., a compression which uses just the r most 
significant PCs (those with the largest eigenvalues). 

If the number of spectra is smaller than the dimensionality 
of the data, i.e., if n < N, then the spectra span a subspace of 
dimensionality n. In this case only n PCs are defined and a full 
reconstruction is achieved with N - n. With n > N, then using 
all PCs in the reconstruction means that any spectrum - even one 
not used to form the PCs - can be reconstructed exactly. With 
n < N this is no longer true. This is actually the case with our 
synthetic data, where n = 1816 and N = 3818. This potentially 
reduces the quality of any reconstruction, because some of the 
data space is not spanned by the PCs. 

Reduced spectral reconstructions for five representative 
SDSS/SEGUE stars, using different numbers of eigenvectors 
computed from the synthetic and real spectra, are shown in Figs. 
[2] and [3] respectively. The residual spectrum, defined as the dif- 
ference between the original and the reconstructed spectrum, is 
shown at the bottom of each column for each pattern and each 
reconstruction. From inspection of these samples, one can see 
how the PCA approach acts as an effective filter to remove noise, 
recover missing and/or borderline features, and to detect out- 
liers in a spectrum that are reconstructed with large errors (e.g., 



Storrie-Lombardi et al.|1 995 ; Bail er- Jones, Irwin & von Hippel| 
1998| l. However, here we also note that there is evidence that 



the Kurucz model spectra we have adopted do not well describe 
SDSS/SEGUE spectra of cool stars (r eff < 5000 K), especially 
when a small number of PCs is assumed: the residual spectrum 
of main sequence stars at T e g = 4431 K and at r e ff = 4752 K 
highlights difficulties in reconstructing, with 5+5 and 25 + 25 
PCs, the C 2 band at 5165 A (see Fig.0. 



A useful measure of the reconstruction error over a set of P 
spectra is 



Q(r) 



P=P , i=N 



(6) 



Figure [4] shows how this error varies with r. Note that while 
the PCs themselves are constructed using the training data set, 
Q(r) is calculated on a different set (namely the set to which the 
regression model is later applied). The three cases show quite 
different behaviour. For SS the error drops quite rapidly with in- 
creasing r, dropping to a constant (but non-zero) gradient after 
about 50 PCs (from a total of 1816), whereas for SR and RR 
the gradient of the curve becomes constant after including just 
a few PCs. The main reason is that real spectra (used either to 
form the PCs or in the projection) show much more variance 
than synthetic spectra, and this is spread over more data dimen- 
sions. A second observation is that the larger the noise, the larger 
the reconstruction error at a given r. For further discussion see 
([1996} or |Bailer- Jones, Irwin & von Hippel| ( |1998) . 



Bailer-Jones 



It is interesting, however, that the curve for SR "levels off" at 
such a low value of r. This may well be a result of the fact, men- 
tioned above, that the PCs only span a subspace of the original 
data space. 

In summary, a PCA compression retains those spectral fea- 
tures which are most common across the data set. It preferen- 
tially removes noise (and rare features), because they are statis- 
tically uncorrected. Note that the atmospheric parameters are 
not used in defining the PCs. 

Thus, considering the above, the choice of the optimal num- 
ber of PCs to retain is a trade-off between retaining informa- 
tion versus reducing dimensionality and noise, and should be 
optimized in conjunction with the regression model. There exist 
more sophisticated methods of dimensionality reduction which 
could be used in the future, such as local and nonlinear varia- 
tions on PCA (see Einbeck, Evers & Bailer- Jones (2007) for a 
review and astronomical application). 



4.2. Wavelength Range Selection (WRS) 

The restriction of an analysis to certain wavelength intervals via 
the exclusion of (hopefully) unimportant ranges, is an alterna- 
tive way to reduce the dimensionality of the input space. This 
provides a way of directly introducing domain information into 
the regression model. While this selection is potentially diffi- 
cult (and the number of permutations extremely large), we show 
below that this approach is particularly effective for the estima- 
tion of the surface gravity parameter, log g. After considering a 
number of alternatives, we chose to restrict the analysis on the 
wavelength ranges 3900-4400 A, 4820-5000 A, 5155-5350 A, 
and 8500-8700 A in the spectra. These regions contain the most 
prominent hydrogen and metal lines, including Call K and H, the 
Balmer lines H tf , H y , and H^, the CH G-band, the Mg lb triplet, 
and the Call triplet. 

5. Results 

In this section we report the results of the three types of models 
developed, SS, RR and SR (for a definition of these see Sect.[T]i. 



5.1. SS- Synthetic vs. Synthetic 

For this analysis we adopt the sample of 1816 noise free syn- 
thetic spectra described in Sect. |2.2| This is randomly split into 
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Fig. 4. PCA spectral reconstruction error, Q (defined in 
equation [6]i on the evaluation data set for SR/RR/SS 
(solid/dashed/dotted lines, respectively) as a function of the 
number of eigenvectors, r, used for reconstruction. 



two equal-sized sets - one for model training, and one for model 
evaluation. 

After a preliminary analysis with the full spectra, we decided 
to use a PCA pre-processing of the data (Sect. 



4.1 1. Principal 



Components are computed using the training set, then both 
sets are projected onto them to yield the admixture coefficients, 
which are then the regression model inputs. PCA is performed 
on the red and blue spectra separately, because this gave a bet- 
ter reconstruction (which in turn reduced systematic offsets in 
the derived parameters). Table [T] shows typical parametrization 
errors for the three stellar atmospheric parameters for different 
numbers of PCs retained in the reconstruction; they all are very 
small and surprisingly lower for log g than for [Fe/H]. We re- 
mark that, when increasing the number of PCs, the error is ini- 
tially determined predominantly by the amount of information 
present in the reconstructed spectra, then by the limited ability of 
the non-linear regression model to make full use of the available 
information. These results, and the analysis of the reconstructed 
spectra, led us to select 25 (blue region) + 25 (red region) PCs 
for the model. 



Table 1. Mean absolute errors on the evaluation set of 908 spec- 
tra in the SS model for different numbers of PCs retained in the 
reconstruction. (As PCA is done separately on the red and blue 
regions, the total number of inputs is twice the number of PCs.) 



PCs 




E\ og g 


-E[Fe/H] 


5 


0.0087 


0.1264 


0.1558 


25 


0.0036 


0.0245 


0.0327 


100 


0.0030 


0.0251 


0.0269 


908 


0.0133 


0.2087 


0.2308 



The above results were obtained with noise-free data, which 
is not very realistic, so we also trained models where both the 
training and evaluation set are degraded with Gaussian additive 
noise to signal-to-noise (SNR) levels of 10/1, 30/1, 50/1 and 
100/1. Even at a SNR of 10/1, the errors are increased by only 
50 K in log r eff , 0.02 dex in log g, and 0.03 dex in [Fe/H]. This 
modest deterioration is on account of the artifically good corre- 
spondence between the training and evaluation set when using 
purely synthetic data; the PCA noise filtering also appears to 
help. Note that whenever we use synthetic spectra to define the 
PCs, we always use noise-free spectra (also in Sect. 5.3 1. 



5.2. RR - Real vs. Real 

Following from our experience with the SS analysis, we build 
an RR regression model to parametrize real spectra. The train- 
ing and evaluation data sets are taken from a set of 38 731 stars 
from 140 SDSS/SEGUE plates, in directions of low reddening, 
which have had atmospheric parameters estimated by a prelim- 
inary version of the SSPP Both training and evaluation sets are 
drawn at random (without replacement) with sizes 19 731 and 
19 000 spectra respectively. We use 2151 pixels in the blue spec- 
trum between 3850-6000 A and 1667 pixels in the red spectrum 
between 6500-9000 A. A PCA compression reduces this to 25 
(blue) + 25 (red) PCs, the PCs themselves formed only from the 
training set. This compresses the data to 1.3% of its former size, 
resulting in more stable and faster models. We use these data 
to predict log T e s, log g, and [Fe/H]. The standard deviations 
(essentially an estimate of their parameter ranges) of the input 
parameter distributions are T e s = 724 K, log g = 0.64 dex, and 
[Fe/H] = 0.54 dex, respectively. These are on the order of the 
RMS errors which a random classifier would achieve. 

In addition to this purely spectral model, we developed an- 
other model in which the four (de-reddened) photometric colours 
u — g, g — r, r — i, and i - z are added as four additional model 
inputs (they are not involved in the PCA). 

Figure [5] compares our model estimates with those from the 
SSPP on the evaluation set. Overall we see good consistency, es- 
pecially for stars with r eff < 8000 K (log T eff = 3.90). Above 
this effective temperature we see that our models underestimate 
log r e ff relative to the SSPP. Our regression models are designed 
to smooth, i.e. interpolate, data. Extrapolation of the model to es- 
timate atmospheric parameters that are not spanned by the train- 
ing set is relatively unconstrained (and any model would need to 
make additional assumptions). Furthermore, the accuracy of the 
RR model is limited by the accuracy of the target atmospheric 
parameters used in training, as well as their consistency across 
the parameter space. In this case, the SSPP estimates are combi- 
nations from several estimation models, each of which operates 
only over a limited parameter range. Thus, the transition we see 
above 8000 K may indicate a temperature region where one of 
the SSPP submodels is dominating the SSPP estimates, and this 
is not well-generalized by our model. Of course, if we decided 
that we wanted to reproduce the SSPP predictions for hot stars, 
we could do this simply by fitting a second-order polynomial to 
our residuals to remove the systematic offset. 

Table [2] quantifies the overall discrepacies for each parame- 
ter. An error in log r e ff of 0.0126 is an error of 2.9%, or 170 K 
at 6000 K. The last line in the table is the performance when we 
include photometry. Adding photometry leads to significant im- 
provement in all three atmospheric parameters. This is not sur- 
prising for effective temperature, as the photometric calibration 
of these bands is less complicated than the spectral calibration. A 
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Fig. 5. Atmospheric parameters estimation with the RR model. We compare our estimated log r e ff, log g, [Fe/H] with those from 
a preliminary version of the SSPP on the 19 000 stars in the evaluation set. The perfect correlation and a linear fit to the data are 
shown with the solid and dashed lines respectively. The histogram of the discrepancies (our estimates minus SSPP estimates) are 
shown in the lower panels. 



more accurate r e ff will permit more accurate log g and [Fe/H]. 
Thus, in directions where interstellar redenning is known to be 
low, photometry should be used. The values listed in the table 
for a given parameter are averaged over all values of the adopted 
atmospheric parameters. Results for gravity, metallicity, and ef- 
fective temperature ranges - dwarfs/giants, low/high metallicity, 
and cool/warm stars - are listed in Tableland in Table [6] 



Table 2. Mean absolute errors on the evaluation set of 19000 
spectra in the RR model (plotted in Fig. [5]). The first line is for 
the full data set (training and evaluation data). The second and 
third are just for the evaluation sets. The third line is for a model 
which included the four photometric colours as additional model 
inputs (predictors). 



set 


PCs 


+ 


£ log Tel 


Ei g g 


E\Fe/H\ 


38 731 


25+25 




0.0090 


0.2699 


0.1339 


19000 


25+25 
25+25 


phot 


0.0126 
0.0082 


0.3644 
0.2791 


0.1949 
0.1616 



5.3. SR - Synthetic vs. Real 

We have shown above that our regression models are capable of 
obtaining accurate and consistent estimates of atmospheric pa- 
rameters when trained and tested on synthetic spectra (SS), and 
also when trained on real spectra with existing parametrizations 
and applied to another sample of real spectra (RR). We now de- 
velop the hybrid approach, SR, in which we train on synthetic 
spectra and use this model to determine atmospheric parameters 
for SDSS/SEGUE spectra directly. A very important aspect of 
this model is processing the synthetic and real data to look simi- 
lar; inaccurate synthetic spectra (e.g. poor models or a poor flux 
calibration) will degrade performance and/or give rise to system- 
atic errors. 

Experience shows that it is advantageous to match the noise 
properties of the synthetic training sample to that of the real sam- 
ple. Essentially, noise acts as a regularizer in the training phase 
and thus improves the overall generalization performance of the 
models (e.g.; Snider et al. 2001 Odewahn et al. 2002 ), in partic- 
ular reducing systematics. For each of the 38 731 SDSS/SEGUE 
stars in the evaluation set we use the SNR reported (for each 
pixel) in the data array included in the FITS file (which was esti- 
mated by the reduction pipeline). We assign a global SNR to the 
spectrum which is the median of all flux bins over the wave- 
length range we retain (viz. 4000-5850 A and 6500-8500 A). 
Figure [6] shows the distribution of these SNR values. Based 
on this, we chose to develop two regression models, one op- 
timized for low SNR real spectra (SNR<35/1, 13 487 stars) 
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SNR distribution 



Fig. 6. Histogram of the SNR distribution for all 38 731 stars of 
the real sample. For each of them, the value for SNR has been 
estimated from the stellar spectrum. 



the other for high SNR real spectra (SNR>35/1, 25 244 stars). 
Experimentation showed that this noise injection does indeed re- 
duce systematics which are obtained when using noise-free data 
for training. 

We explored the application of dimensionality reduction 
with PCA, but found that this led to rather large systematic er- 
rors in the parameters, in particular in log g (up to 1 .0 dex). We 
instead found that it is better simply to select wavelength regions 
which are known to be the most sensitive to surface gravity (e.g. 
3900-4400 A, 4820-5000 A, 5155-5350A and 8500-8700 A). 
This is perhaps not unexpected, since essentially all of the meth- 
ods that are used by the SSPP to define the target log g values use 
only these restricted wavelength ranges. This may also indicate 
that the gravity signature in real stars outside of the wavelength 
regions selected above behaves differently from the signature in 
the synthetic spectra. Either way, the excluded regions show less 
sensitivity to log g, so for this parameter these regions do not add 
information, only data that are uncorrelated with the parameter 
of interest (so are effectively just noise). It is also possible, of 
course, that the PCA may be filtering out subtle (weak) features 
which are strong predictors of log g. 

Based on the above considerations, our final model uses PCA 
for estimating T e ff and [Fe/H] and WRS for estimating log g. A 
separate model is used for estimating each parameter (although 
the [Fe/H] model also predicts the other two, the results of which 
are disregarded). 

Figure [JJ compares our model atmospheric parameter esti- 
mates with those from the preliminary SSPP for the 38 731 stars 
in the evaluation set. While the overall consistency between 
the two models is reasonably good, we (again) notice discrep- 
ancies at the extreme parameter values, in particular for T e g. 
This is sometimes an indication that the model has not been 
well trained, i.e., it has not located a good local minimum of 
the error function (it can never be shown that the global mini- 
mum has been found with anything but an exhaustive search). 
However, there are inevitably problems with spectral mismatch, 
in the sense that the synthetic spectra do not reproduce all of 
the complexities of the spectra of real stars. The absence of sev- 
eral molecular species in the linelists for the synthetic spectra 
may also be contributing to this problem, especially for cooler 
stars where they are expected to be more important. For the de- 
termination of metallicity, we observe that our model predicts 



lower metallicities for the lowest metallicity stars. This is prob- 
ably a consequence of the lack of synthetic samples between 
-4.0 < [Fe/H] < -2.5 (see Fig. |T|i in our current grid. 



Table 3. Mean absolute discrepancies (between our SR model 
and SSPP) calculated on the evaluation set of 38 731 real spectra 
(see also Fig. 7j. Our models use PCA pre-processing for esti- 
mating [Fe/H] and log T e g and WRS pre-processing for estimat- 
ing log g; for the latter, PCA results are shown for comparison. 
Separate models were applied for low and high SNR spectra (the 
transition being at SNR=35/1). 



Method 


SNR 




E\ g g 


-E[Fe/H] 


PCA (25+25) 


low 
high 


0.0138 

0.0143 
0.0136 


0.4288 
0.7549 
0.3465 


0.2606 

0.3023 
0.2384 


WRS 


low 
high 




0.4459 

0.4495 
0.4450 





Table [3] shows the global results (averaged over all stars 
and atmospheric parameters). It is interesting that the WRS pre- 
processing results in little difference in the log g discrepancy 
for the low and high SNR regimes. Results for gravity, metallic- 
ity, and effective temperature ranges - dwarfs/giants, low/high 
metallicity, and cool/warm stars - are listed in Table [7J and in 
Table [8] and visualized in Fig. [8] We note that, in the estimation 
of log g, a systematic difference (our model predictions lower 
than SSPP) occurs in the range 7/ eff = 5600 - 6700 K for low- 
metallicity giants. Unfortunately we cannot include photometry 
in the SR models, because the synthetic colours are not yet well- 
calibrated, and their zero points on the AB system are still under 
discussion. 



5.4. Comparison of RR and SR 

The RR and SR models developed above both appear to give 
reasonable predictions, as measured by their mean accuracies 
with respect to the SSPP predictions - T e ^ with residual of 
0.013/0.014 (~ 170 K), log g with a residual of 0.36/0.45 dex 
and [Fe/H] with a residual of 0.19/0.26 dex for RR/SR respec- 
tively. 

The global discrepancies are larger with SR for log g and 
[Fe/H], but this is not surprising because it is entirely indepen- 
dent of the SSPP parameter estimates. While the synthetic spec- 
tra place a limit on the performance of the SR model, this is true 
of any parametrization model. Physical parameters can only be 
derived using physical models; none can be measured "directly". 
The advantage of the SR approach is that it only uses one set in 
the parametrizations, it can easily be retrained using new syn- 
thetic spectra, and it provides a quick, general model which oper- 
ates over the entire parameter range. In effect, the work in getting 
good predictions is taken out of the machine learning model and 
moved to the definition of the templates and the pre-processing. 

We find that PCA delivers more accurate atmospheric param- 
eters when the training data are the actual SDSS spectra with 
previously estimated parameters, whereas WRS appears supe- 
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Fig. 7. Atmospheric parameters estimation with the SR model. Comparison between our derived log T e ff, log g, [Fe/H] and those 
estimated by a preliminary version of SSPP for a set of 38 731 stars. The perfect correlation and a linear fit to the data are shown 
with the solid and dashed lines respectively. The distribution of the residuals (model minus SSPP) are shown in the bottom panels. 




log Teff 
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log Teff 



Fig. 8. More detailed visualization of the SR model discrepancies (Fig.^. The diamonds joined by lines show mean absolute residual 
(solid lines) and mean residual (dashed lines) for low metallicity ([Fe/H] < -1.5, white lines) and high metallicity ([Fe/H] > -1.5, 
grey lines) averaged over all stars in a bin which has the diamond point as its centre. The mean residual traces the systematic offset 
(bias), the mean absolute the scatter. 



rior for the estimation of log g templates, especially from lower 
SNR spectra. 



From the subsample of 19 000 stars used as the evaluation set 
in RR we compare the SR predictions with the RR predictions 
(Pig. [9j. The mean absolute differences are on the order of 0.010 
in log r eff (150 K), 0.35 dex in log g, and 0.22 dex in [Fe/H]. 



6. Application: Globular Clusters 

Comparison of theoretical isochrones with data from clusters of- 
fers an excellent opportunity to test the present model predic- 
tions. In particular, we can use them to assess the calibration of 
the parameter determinations. Here we focus our discussion on 
the globular cluster M 15, but we have also analysed the glob- 
ular clusters M 13 and M 2 and the open cluster NGC 2420, all 
observed by SDSS/SEGUE. We select likely members, then es- 
timate their atmospheric parameters, and overplot these on a set 
of isochrones fixed at literature values for the cluster distance 
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Fig. 9. Comparison between SR and RR estimations on the 19 000 real spectra in common in their evaluation sets. The line shows 
the perfect correlation and the bottom panels the distributions of residuals. 



modulus, age, and metallicity. If these values (and the isochrones 
themselves) are correct, discrepancies between our estimates and 
the isochrones would indicate problems in the calibrations of the 
atmospheric parameters (e.g. of the synthetic spectra on which 
the regression models are based). We note that |Lee et al.| f2007 ) 
have looked more carefully at the three globular clusters, and 
make an independent target selection based also on stellar den- 
sities, from which they derive mean metallicities and radial ve- 
locities for the clusters. 



6.1. M 15 - Selection 

The globular cluster M 15 
RA=21 h 29 m 58.3 s , DEC=+12° 



is locate d in 
10' 01" dHarris 



the s ky at 
1996), and 



has been extensively studied in the past (e.g., Sandage 1970; 



Binn ey & Merrffield||1998) . SDSS/SEGUE plates 1960 and 
1962 include observations of its members. Figure [10] shows 
the distribution of the 526 stars with available SDSS/SEGUE 
spectroscopy and ugriz photometry. The central regions of 
the clusters are not generally available for spectroscopic 
observation, due to fibre placement restrictions in the SDSS 
spectrographs. This must be borne in mind when interpreting 
the results we describe below. 

Based on position, we initially select 133 candidate members 
in the region 322°. 25 < a < 322°. 75 and 11°. 90 < 6 < 12°. 40, 



as represented by the box shown in Fig. 10 



The distribution of the atmospheric parameters [Fe/H] ver- 
sus log g of this sample, using both the RR and SR models, is 
shown in Fig. 1 1 The stars clearly fall into two groups, due to 



322.5 323.0 
RA 



false cluster members which we can plausibly take to be stars 



Fig. 10. Distribution on the sky of the 526 stars present from 
SDSS/SEGUE plates 1960 and 1962. The box defines the selec- 
tion criteria (322° .25 < a < 322° .75 and 11°. 90 < 6 < 12° .40) 
which produces 133 M 15 candidates. 



projected in front of the cluster from the Galactic field (gener- 
ally at higher metallicity), and stars from the globular cluster 
itself (lower metallicity). It is also obvious that, given the ap- 
parent magnitude limits of SDSS/SEGUE, we would not expect 
to see higher-gravity main sequence stars that are true cluster 
members. 
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Fig. 11. Distribution of [Fe/H] vs. log g for the 133 positionally- 
selected M 15 candidates from Fig. 10 atmospheric parameters 
are from the RR (top) and SR (bottom) models. Of these 133 
candidates, we retain only 46 (RR) or 45 (SR) stars in the low- 
metallicity group as likely cluster members. Among these, 8 
(RR) or 7 (SR) identified as main sequence stars (asterisks) and 
40 by radial- velocity selection (filled dots). Those also selected 
as members in a preliminary analysis are highlighted in gray; 
members find to be doubtful (due to their measured abundances 
or lack of any metallicity estimate) are marked with a plus sign. 



To obtain a more clean sample of likely cluster members, 
we select from the observed sample using published estimates 
of radial velocities and metallicities for the cluster (see Table[4]i. 

We first select based on radial velocity; specifically, we re- 
tain as candidates only those stars with -126 km s < Vr < 
-100 km s _1 . This cut preferentially excludes metal-rich main 
sequence stars, and results in a remaining sample that contains 
40 candidates with [Fe/H] < -1.5 out of a total of 42. 

We define a second sample, now of main sequence stars; 
namely, the 8 or 7 stars (for RR/SR respectively) having metal 
abundance [Fe/H] < -1.5 and log g > 3.5, without any radial 
velocity selection. Using the absolute magnitude determination 
for late-type dwarfs as a function of SDSS photometry ( |Bilir| 
|et al.|2005) 



this second sample shows a distance modulus (m - M) = 14.67, 
in agreement with the typical value (w-M)m 15 = 14.93 reported 
by |Sandage| ( [T970"] l. 

The complete sample of M 15 cluster members has 46 (RR)/ 
45 (SR) stars. The entire radial velocity selected sample is shown 
in Fig 1 1 with filled circles, while the metal-poor main sequence 
stars we suspect are cluster members are shown with asterisks. 

M 15 has been previously analysed during the course of the 
development of the SSPR Our initial sample (of 133 stars) in- 
cludes 26 of the 35 candidates. Of these, 7 stars which have been 
rejected on the grounds of their apparently discrepant estimated 
abundance, or lack of an estimate at all, are marked with a plus 
sign. The 19 stars confirmed as likely members are also iden- 
tified as part of our candidate members; we highlight these as 
light-colour dots in Fig. 1 1 Inspection of this figure shows the 
M 15 members as a clump of stars, albeit one which is more 
clumped in the RR predictions of the atmospheric parameters 
than in the SR predictions of the atmospheric parameters. 

From the sample of cluster members with consistent metal- 
licities and radial velocities we obtain a mean metallicity of 
[Fe/H] = -2.20 ± 0.11 dex (RR)/[Fe/H] = -2.26 ± 0.26 dex 
(SR). Using just the giants in this sample (i.e., excluding the 
metal-poor main sequence stars) we obtain [Fe/H] = -2.20 ± 
0.10 dex (RR)/[Fe/H] = -2.35 + 0.14 dex (SR). These values 
are in good agreement with previous determinations in the liter- 
ature (see Table[4|. 

6.2. M 15 - Isochrones 

We now compare our atmospheric parameter estimates with the- 
oretical SDSS isochrones from |Girardi et al.| ( [2004| l. We adopt 
an age of 13.2 Gyr, a metallicity [Fe/H] = -2.22 dex, and a 
distance modulus of 14.93 kpc (e. g., Sandage 1970 Binney & 
|Merrifield|1998] l. 

shows 



Figure 
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the colour-magnitude and effective 
temperature-gravity diagrams for the likely M 15 members over- 
plotted with the theoretical isochrones. These isochrones bracket 
the candidates reasonably well in the colour-magnitude dia- 
gram, but the distribution in the atmospheric parameter plane 
shows systematic offsets, in particular for the RR model esti- 
mates. A zero-point offset in either the gravity or temperature 
parameterizations (or in the isochrones) would improve the co- 
incidence. On the other hand, the RR model clearly yields a 
tighter distribution in the atmospheric parameters. Thus, if we 
believe the isochrones, then we can conclude that the RR model 
obtains more precise parameter estimates, while the SR model 
obtains more accurate ones. In fact, if we would attribute the 
offset due entirely to gravity, we would have to apply correc- 
tions of about 0.60 dex (RR) or 0.25 dex (SR) to our estimates in 
order to obtain coincidence with their predicted location in the 
effective temperature-gravity planes. 

6.3. Other Clusters 

We carried out the same analysis for three additional clusters 
which have also been extensively studied in the past, and so have 
reasonably consistent determinations of metallicity, age, and dis- 



globular clusters M 13 (e.g., 


Sandage 1970 Lupton et al.|1987 


|Shetrone| 1 9941 |Harris| 1996 ; 


Binney & Memfield 1998) appear 



M g = 5.79(g - r) + 1.242(r - z) + 1.412 



(7) 



ter M 2 (e.g., |Harris|1996||Lazaro et al.p006] l o n SEGUE plate 
1961; and from the open cluster NGC 2420 (e.g.; |McClure etal] 
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Fig. 12. The left panel shows the colour-magnitude diagram for M 15, and the two other panels the distribution of atmospheric 
parameters log g vs. log T e g from the RR (middle) and SR (right) models. Of the stars selected as candidates, the asterisks denote 
main sequence metal-poor stars, the filled dots the members based on radial velocity constrain. Confirmed and doubtful members 
assigned in a preliminary analysis are coloured gray and marked as plus sign respectively. Overplotted are isochrones from |Girardi| 
et al. (2004 1 with metallicities and ages which bracket the values given in Table |4] i.e. at Z = 0.0001 (solid), Z = 0.0004 (dashed) 
for ages of 12.59 Gyr (black) and 14.13 Gyr (gray). 



19741 |Smith| 19871 |Tianxing| 1987) on SEGUE plates 2078 and 
2079. For each of these, we select likely members following the 
same procedures as for the M 15 analysis (Sect. 6.1 1 and com- 



pare them with isochrones with parameters based on previous 
analyses. 



Figure 13 shows the distribution of the atmospheric pa- 
rameters for expected members of each cluster in the colour- 
magnitude and in the (log r e ff-log g) plane, overplotted with 
the theoretical isochrones selected to best match each cluster's 
properties. Inspection of these distributions confirms our previ- 
ous conclusions for the case ofM 15 - (1) there exists a sys- 
tematic offset in effective temperature and/or surface gravity be- 
tween the estimated parameters and those expected from the the- 
oretical isochrones, and (2) the RR model provides more precise 
atmospheric parameter estimates, while the SR model provides 
more accurate ones. 

We are limited by the small number of likely cluster mem- 
bers in some cases, especially for M 2, which (so far) appears on 
only one SEGUE plate. However, it seems that this evidence is 
more clearly visible in the globular clusters which, as for M 15, 
are old and metal poor. In the atmospheric parameter plane, the 
distribution for the open cluster NGC 2420 from the SR model 
looks a bit confusing. It is plausible that this cluster is too metal- 
rich to obtain good atmospheric parameter estimates, as the ex- 
pected parameters are at the extreme of the regions covered by 
the synthetic grid used for training. Larger uncertainties are cer- 
tainly present in this range of metallicity (see Tables[7j |8). These 
limitations are under study at the moment. 



7. Summary and conclusions 

We have developed models to estimate the three primary stel- 
lar atmospheric parameters (T e s, log g, and [Fe/H]) from 
SDSS/SEGUE spectra. These models produce self-consistent 
parameter estimates and can be implemented into an automated 
data processing pipeline. Our models rely on an initial configura- 
tion (or "training") phase, which for one of the models (RR) uses 
pre-classified observed data, for the other (SR) synthetic spectra 



selected by the user. Both are flexible, in that new models can 
easily be introduced by changing the set of training templates. 

Both models are nonlinear, regularized regression models. 
The RR model uses an initial PC A compression of the data to 
reduce the dimensionality (from 3818 to 50), thus producing a 
more robust (and precise) parametrizer (which reduces the di- 
mensionality further to 3, i.e., the three atmospheric parameters). 
They are also rapid, requiring of the order of one millisecond per 
star on a single, modest CPU. 

The RR model has the advantage that exactly the same type 
of data are used in the training and application phases, thus 
eliminating the issue of discrepancies in the flux calibration or 
cosmic variance of the two samples. Of course, this requires 
an independent estimation method ("basis parameterizer") to 
parametrize the training templates (which itself must use syn- 
thetic models at some level). Our regression model then auto- 
mates and - more importantly - generalizes this basis parameter- 
izer. Indeed, the basis parameterizer may even comprise multiple 
algorithms, perhaps operating over different parameters ranges 
or used in a voting system to estimate atmospheric parame- 
ters. This is true in the present case, where the basis parame- 
terizer comes from a preliminary version of the SDSS/SEGUE 
Spectroscopic Parameter Pipeline (SSPP; Beers et al. 2006| Lee 
|et al 12007) . 

In contrast, our SR model is trained directly on synthetic 
spectra, dispensing with the need for a basis parameterizer. For 
best results these training data should have noise properties sim- 
ilar to the observed data (which improves the regularization). 
We therefore implemented different models for different SNR 
ranges. PCA is again used for data compression, except for the 
surface gravity parameter log g, where better results were ob- 
tained using a subset of spectral features known to be most sen- 
sitive to this parameter. 

For each atmospheric parameter, the accuracy of our pre- 
dictions with respect to previous estimates (SSPP) are T e g to 
170/170 K, log g to 0.36/0.45 dex and [Fe/H] to 0.19/0.26 dex 
for methods RR and SR respectively. Consistency between the 
two approaches is on order of 150 K in T e ff, 0.35 dex in log g, 
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Fig. 13. As Fig. 12 Top: M 13 (globular cluster) candidates. Centre: M 2 (globular cluster) candidates. Bottom: NGC 2420 (open 
cluster) candidates. For the globular clusters, isochrones at Z = 0.0001 (solid), Z = 0.0004 (dashed) for ages of 12.59 Gyr (black) 
and 14.13 Gyr (gray); for the open cluster, isochrones at Z = 0.004 (solid), Z = 0.008 (dashed) for ages of 3.162 Gyr (black) and 
3.548 Gyr (gray). 



Table 4. Globular/Open Clusters, literature values. The selection constraints applied for identification of likely members are labeled 
with *. 



Cluster 






(RA, DEC) 


RA* 


DEC* 


[Fe/H] 


age 


m-M 


RV 


RV* 










(degree) 


(degree) 


(dex) 


(Gyr) 


(kpc) 


(kms- 1 ) 


(kms- 1 ) 


M 15 


21 h 


29 m 


58.3 s , +12° 10' 01" 


322.25,322.75 


11.90, 12.40 


-2.22 


13.2 


14.93 


-110 


-126,-100 


M 13 


16* 


41 m 


41.5 s , +36° 27' 37" 


250.00, 250.90 


36.10, 36.90 


-1.70 


12.7 


14.07 


-250 


-262, -243 


M2 


21 h 


33 m 


29.3 s , -00° 49' 23" 


323.10,323.60 


-1.05,-0.60 


-1.53 


13.0 


10.49 





-20, 20 


NGC 2420 


07 h 


38 m 


24.0 s , +21° 34' 27" 


114.40, 115.10 


21.20, 22.10 


-0.50 


3/4 


11.40 


73 


50, 86 



and 0.22 dex in [Fe/H]. Some discrepancies are probably due 
to the different Kurucz models adopted in our SR model and in 
some of the methods employed in the SSPP. 

As a test of our model predictions, we estimated atmospheric 
parameters for globular/open cluster members and compared 
these to theoretical isochrones. We found that RR gives more 
precise parameter estimates (stars show smaller scatter) whereas 
SR gives more accurate ones (stars show smaller offset, or bias). 
We can use this information to improve the parameter calibration 
of the basis parametrizers or the pre-processing of the synthetic 
spectra. We have also used our models to estimate atmospheric 



parameters for 89 600 SEGUE and 194 172 SDSS (DR-5) stellar 
spectra, which are being used for further scientific investigations. 

We found that the inclusion of the four SDSS photometric 
colours improves the precision of parameter estimation signifi- 
cantly, but this will only work for zero (or very low) extinction 
regions. In principle, our models can be extended to predict ex- 
tinction (by inclusion of its variance in the training set), allowing 
us to then use both photometry and spectroscopy to predict atmo- 
spheric parameters along significantly reddened lines of sight. 

Our RR model has already been successfully integrated into 
the SSPR The SR will undergo further refinement with improved 
synthetic spectra. In particular, models with more molecules in- 
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eluded in the linelists will improve the representation of cool 
stars. An extension to hotter stars will make the model more 
widely applicable (at present such stars can be filtered out via 
the PCA reconsuction error). Looking further ahead, the SR ap- 
proach will form the basis for atmospheric parameter estima- 
tion from the very low resolution spectrophotometry (R^12- 
40) to be obtained with Gaia (albeit using a more sophisti- 
cated and knowledge-based approach to regression, which also 
includes the accurate parallaxes and high-precision photometry 
from Gaia). Our pattern recognition approach is probably indis- 
pensable in such an application, because the low resolution and 
spectral purity of the spectrophotometry prevent the definition of 
traditional indices. 
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Table 5. RR: Partial results. We list the mean fi and the corresponding standard deviation cr of the difference Committee-SSPP for 
each of the different stellar types and parameter ranges. 
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each of the different stellar temperatures and metallicity ranges. 
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Table 7. SR: Partial results. We list the mean and the corresponding standard deviation <x of the difference Committee-SSPP for 
each of the different stellar types and parameter ranges. 
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Table 8. SR: Partial results. We list the mean and the corresponding standard deviation <x of the difference Committee-SSPP for 
each of the different stellar temperatures and metallicity ranges. 
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